Characterizing Weblog Corpora
نویسندگان
چکیده
In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not extensive documents and exhibit undesirable characteristics from a clustering perspective such as low frequency terms, short vocabulary size and vocabulary overlapping of some domains. Furthermore, their characteristics vary widely depending on the specific interests of the writer, their linguistic style, and the volume of texts that they produce. In this work, we present a set of evaluation features by which we can establish the relative hardness of the clustering task, i.e., how easy or difficult it will be to accurately cluster the blog datasets. These are the shortness, domain broadness, class imbalance, stylometry, and structure. We report results obtained on corpora extracted from two popular blogging sites, Boing Boing (“B-B”) and Slashdot 1 . The results are contrasted with characterizations of a number of other corpora, consisting of newspaper articles and academic papers. We can use the results to provide knowledge of the most appropriate methodology for clustering.
منابع مشابه
Building Emotion Lexicon from Weblog Corpora
An emotion lexicon is an indispensable resource for emotion analysis. This paper aims to mine the relationships between words and emotions using weblog corpora. A collocation model is proposed to learn emotion lexicons from weblog articles. Emotion classification at sentence level is experimented by using the mined lexicons to demonstrate their usefulness.
متن کاملIdentifying Personal Narratives in Chinese Weblog PostsTitleIdentifying Personal Narratives in Chinese Weblog Posts
Automated text classification technologies have enabled researchers to amass enormous collections of personal narratives posted to English-language weblogs. In this paper, we explore analogous approaches to identify personal narratives in Chinese weblog posts as a precursor to the future empirical studies of cross-cultural differences in narrative structure. We describe the collection of over h...
متن کاملIdentifying Personal Narratives in Chinese Weblog Posts
Automated text classification technologies have enabled researchers to amass enormous collections of personal narratives posted to English-language weblogs. In this paper, we explore analogous approaches to identify personal narratives in Chinese weblog posts as a precursor to the future empirical studies of cross-cultural differences in narrative structure. We describe the collection of over h...
متن کاملMinimal Narrative Annotation Schemes and Their Applications
The increased use of large corpora in narrative research has created new opportunities for empirical research and intelligent narrative technologies. To best exploit the value of these corpora, several research groups are eschewing complex discourse analysis techniques in favor of high-level minimalist narrative annotation schemes that can be quickly applied, achieve high inter-rater agreement,...
متن کاملLeave a Reply: An Analysis of Weblog Comments
Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog comments and their relation to the posts. Using a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009